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Abstract. One of the core applications of machine learning to knowledge discovery 

consists on building a function (a hypothesis) from a given amount of data (for instance 

a decision tree or a neural network) such that we can use it afterwards to predict new 

instances of the data. In this paper, we focus on a particular situation where we assume 

that the hypothesis we want to use for prediction is very simple, and thus, the hypotheses 
GN ' 

class is of feasible size. We study the problem of how to determine which of the hypotheses 

00 ' 

in the class is almost the best one. We present two on-line sampling algorithms for 
-g I selecting hypotheses, give theoretical bounds on the number of necessary examples, and 
analize them exprimentally. We compare them with the simple batch sampling approach 
commonly used and show that in most of the situations our algorithms use much fewer 
?-h ' number of examples. 

1 Introduction and Motivation 

The ubiquity of computers in business and commerce has lead to generation of huge 
quantities of stored data. A simple commercial transaction, phone call or use of a credit 
card is usually stored in a computer. Todays databases are growing in size and therefore 
there is a clear need for automatic tools for analyzing and understanding these data. 
The field known as knowledge discovery and data mining aims at understandings and 
developing all the issues concern with the extraction of patterns from vast amount of data. 
Some of the techniques used are basically machine learning techniques. However, due to 
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the restriction that the data available is very large, many machine learning techniques do 
not always scale well and can not just simply be applied. 

One of the core applications of machine learning to knowledge discovery consists of 
building a function from a given amount data (for instance a decision tree or a neural 
network) such that we can later use it to predict the behavior of new instances of the 
data. This is commonly know as concept learning or supervised learning. 

Most of the previous research in machine learning has focused on developing efficient 
techniques for obtaining highly accurate predictors. For achieving high accuracy, it is 
better that learning algorithms can handle complicated predictors, and developing efficient 
algorithms for complicated predictors has been studied intensively in machine learning. 

On the other hand, for knowledge discovery, there are some other aspects of concept 
learning that should be considered, and we discuss, in this paper, one of them. We study 
concept learning (or, more simply, hypotheses selection) for a particular situation that 
we describe in the following. We assume that in our situation we have a class Ti of very 
simple hypotheses, and we want to select one of the reasonably accurate hypotheses from 
them, by using a given set of data, i.e., labeled examples. Since hypotheses we deal with 
are very simple, we cannot hope, in general, to find highly accurate hypotheses in 7i. 
On the other hand, the size of hypotheses space 7i is relatively small and feasible. We 
also assume that the size of the data available is huge, and thus, it is very inefficient to 
use all examples in the dataset. Simple hypotheses have been studied before by several 
researchers and it has been reported that in some cases they can achieve surprisingly high 
accuracy (see, e.g., || ^ |l[). Moreover, with the new discover of voting methods like 
boosting H, bagging 0, or error-correcting output codes f|, several of these hypotheses 
can be combined in a way that the overall precision becomes extremely high. 

Perhaps the paper by Holte |5j best exemplifies our problem. In that paper he performs 
several experiments with some datasets from the repository of the University of California 
at Irvine. His learning algorithm is extremely simple, just obtains a training set from the 
datasets, it builds a set of very simple hypotheses according to the different features of the 
dataset (see the paper for more details on how to build the set of simple hypotheses) and 
then selects the hypothesis that has the highest accuracy on the training set. It turns out 
that this simple approach is indeed efficient since for most of the datasets the accuracy is 
between 80 and 90 percent. His choice of training set size is totally arbitrary, 2/3 of the 
whole dataset. If the dataset avalaible is huge as it happens in many situations then this 
choice might be very inefficient. 

On the other hand, the obvious approach for solving this problem that is commonly 
used in computational learning theory || is to first choose randomly a certain number 
m of examples from the dataset, and then select the hypothesis that performs best on 
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these examples. (We will call this simple hypotheses selection Batch Selection (BS) in 
this paper.) The number m is calculated so that the best hypotheses on the selected 
sample is close to the real best one with high probability; such m can be calculated by 
using uniform convergence bounds like the Chernoff or the Hoeffding bound (see, e.g., 
for some examples of this approach). However, if we want to apply this method in a 
real setting we will encounter two problems. First, the theoretical bounds are usually too 
pessimistic and thus the bounds obtained are not practical. Second, to obtain this bounds 
we need to have certain knowledge about the accuracy of hypotheses in a given hypothesis 
space. What is usually assumed is that we know a lower bound on the accuracy of the 
best hypothesis. Again, this lower bound might be far from the real accuracy of the best 
hypothesis and thus the theoretical bound becomes too pessimistic. Or even worst, in 
many applications we just do not know anything about the accuracy of the hypotheses. 

In this paper we propose two algorithms for solving this problem, obtain theoretical 
bounds of their performance, and evaluate them experimentally. Our goal is to obtain 
algorithms that are useful in practice but that also have certain theoretical guarantees 
about their performance. The first distinct characteristic is that we obtain the examples 
in an on-line manner rather than in batch. The second is that the number of examples 
has less dependency on the lower bound of the accuracy than the above obvious Batch 
Selection. More specifically, if 70 is the accuracy of the best hypothesis, and 7 is the 
lower bound for 70 we would use, then the sample size m for Batch Selection given by the 
theoretical bound is 0{1/^ 2 ) (ignoring dependencies in other parameters). On the other 
hand, the sample size of our first algorithm is 0(1/770), and that of the second one is 

o{ihl). 

The paper is organized as follows. In the following section we give some definitions. In 
Section 3 we state the two selection algorithms and prove their performance theoretically. 
In the last section we compare and analyze them experimentally. 

2 Preliminaries 

Throughout this paper, we use Ti and n to denote the set of hypotheses and its size, 
and use T> to denote a distribution on instances. We assume some EX-pQ that generates 
instances according to the distribution T>, and each selection algorithm can make use of 
EXx>(). For any h e Ti, let pic v (h) denote the accuracy of h, that is, the probability that 
h gives a collect prediction to x for a randomly given x under the distribution T>. Let ^0 
denote the best hypothesis in Ti (w.r.t.P); that is, prCp(/io) = max{prc X) (/i)|/i G Ti}. Let 
70 denote prc x ,(/io) — 1/2; that is, prc- D (/io) = 1/2 + 70. 

We use Pupper and pi OW or to denote upper and lower tail probabilities of independent 
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Bernoulli trials. More specifically, for any t > 1 and p, < p < 1, consider £ independent 
random variables X\, ...,X t each of which takes and 1 with probability 1 — p and p. 
Then for any e > 0, we define p U p P er(p, £, t) and pi OW er(p, e, t) as follows: 

p U pper(p,£,t) = Pr{^2 Xi > pt + et}, and pi OW er(p,£,t) = Pr{ ^ Xj < pt - et }. 

i=l i=l 

For these tail probabilities, several bounds have been used in the literature; here we 
make use of the following ones (see, e.g., ||). 

Theorem 2.1. (Hoeffding bound) 

For some constant ch > 0, and for any p, e, and t, we have 

Pu PP cr(p,£,t) < exp(-c H e 2 i), and pi ower (p,e,t) < exp(-c R e 2 t). 

Remark. The Hoeffding bound used in the literature uses ch = 2. Later in this paper, 
we will use different constants that work respectively in a certain situation. 

By using this bound, we can estimate the sufficient number of examples to guarantee 
that Batch Selection, the simple hypothesis selection algorithm, yields a hypothesis of 
reasonable accuracy with high probability. (In the following, we use BS(<5, 7, m) to denote 
the execution of Batch Selection for parameters 5, 7 and m, the sample size. Recall that 
the hypotheses space, its size, and the accuracy of best hypothesis is fixed, throughout 
this paper, to H, n, and 1/2 + 7 .) 

Theorem 2.2. For any 7 and 8, < 7, 5 < 1, if 7 < 70 and m = 161n(2n/<5)/(cH7 2 ) 
then with probability more than 1 — 5, BS(7, S, m) yields some hypothesis h with prc v (h) 
> I/2 + 70/2. 



Proof. Follows from the Hoeffding bound in Theorem |2.1|. □ 



3 On-line Selection Algorithms and Their Analysis 

Here we present our two on-line selection algorithms and investigate their reliability and 
efficiency theoretically. In our analysis of the algorithms we count each while-iteration 
as one step; thus, the number of steps is equal to the number of examples needed in the 
algorithm. By "at the t step" we precisely mean "at the point just after the tth while- 
iteration." Throughout this section, we denote by #t{h) the number of examples for which 
the hypothesis h succeeds within t steps. It will be also useful for our analysis to partition 
the hypothesis space in two sets depending on the precision of each hypothesis. Thus, let 
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^good (resp., 7Y ba d) denote the set of hypotheses h such that prc v (h) > l/2 + 7 /2 (resp., 
prc v (h) < l/2 + 7o/2). This partition can be done in an arbitrary way. The complexity of 
our algorithms depends on it but can be easily adapted to a more restrictive condition (for 
instance h G TC goo d if prc v (h) > 1/2 + 37 /4) if it is needed for a particular application. 
Obviously, the more demanding is the definition of H goo d, the greater is the complexity 
of our algorithms. 

In our analysis, we ignore small difference occurring by taking ceiling or floor function, 
or by computing real number with finite precision. 

3.1 Constrained Selection Algorithm 

We begin by introducing a function that is used to determine an important parameter of 
our algorithm. For a given n, 5, and 7, define 6cs( n )^7) by 

MM, 7) = ^''"((t) (ch^-dJ) = ^'"'(cH^-w)' 

Now our first algorithm, that we denote by CS from constrained selection, is stated as 
follows. 

Algorithm CS(<5, 7) 
B <- 37&cs(M,7)/4; 
set w{h) <— for all h e 7i; 
while V/i G H [w(h) <B]do 
(x,b) ^EX V Q; 

W <- {hen-.h(x) =b}; n '<-\H'\; 
for each h E H do 

ifheH' then w(h) <- w{h) + 1 - n'/n; 
else w(h) <— w(h) — n'/n; 

end-for 
end-while 

output h <EH with the largest w(h); 

Note that the number n' of successful hypotheses may vary at each step, which makes 
our analysis difficult. For avoiding this difficulty, we approximate n' as n/2; that is, we 
assume that a half of hypotheses in H always succeeds on a given example. In other 
words, we assume the following. 

Assumption. After t steps (i.e., after t while- iterations), the following holds for each 

hen. 
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w{h) = # t (h)-t/2, 



Remark. In fact, we can modify CS to the one satisfying this assumption; that is, use a 
fixed, i.e., 1/2, decrement term instead of n' /n. As our experiments show, both algorithms 
seem to have almost the same reliability, while the modified algorithm has more stable 
complexity. We believe, however, that the original algorithm is more efficient in many 
practical applications. (See the next section for our experiments and discussion.) 

First we investigate the reliability of this algorithm. 

Theorem 3.1. For any 7 and #,0<7,<5<l,if7<7o, then with probability more than 
1 — S, CS(7, 5) yields some hypothesis h G H goo d- 

Proof. We estimate the error probability P eTT , i.e., the probability that CS chooses some 
hypothesis with pre D (/i) < 1/2 + 7o/2, and show that it is less than 6, in the following 
way. 

P err = Pr{ [J [ CS stops at the tth step and yields some h G H-bad ] } 
cs t>i 

< Pr{ [J [ 3h G 7ibad[ w(h) reaches B at the tth step (for the first time) ] 

* - A V/i£ 7~t goo <i[w{h) has not reached B within t — 1 steps] ] } 

< y~] Pr{ [J [ [w(h) reaches B within t steps] 

^eWbad t_ ^ [w(h ) has not reached B within t — 1 steps] ] }. 

Let to = &cs( ra , 5,7) and t = (7/70)^0- (Note that £ < *o-) We estimate the above 
probability considering two cases: t <to and t > to + 1- That is, we consider the following 
two probabilities. 

P\{h) = Pr{ [J [[iy(/i) reaches B within t steps] 

CS 



t<t 



A [w(ho) has not reached B within t — 1 steps] ] }, and 



P 2 (^) = Pr{ U [[w(/i) reaches i? within t steps] 

to+ _t ^ [iw(/i ) has not reached B within t — 1 steps] ] }. 



In Lemma |3.2| and Lemma |3]3] below, we prove that both P\{h) and P2(h) are bounded 



by S/2n for any h G 7ibad- Therefore we have 

Perr < E Pl{h) + P 2 {h) < 71 ( ^- + ^ j =5. 



□ 
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Lemma 3.2. For any h G Tibad, we have P\{h) < 8/2n. 

Proof. We bound the probability P[(h) = Prcs{ Ut<t [w(h) reaches to B within t steps] }. 
Clearly P x (h) < P[{h). 

The probability P[{h) is in fact the same as the probability that w{h) reaches to B 
in t steps. Now suppose that w(h) reaches to B in to steps. Then for some t < to, 
w(h) > B at the tth step (i.e., just after the tth step). From our assumption, we have 
w(h) = #t(/i) - t/2 at the tth step. Also recall that B = ^t Q /A and that E[# t (/i)] < 
t/2 + 7ot/2 (since h G Tibad)- Hence, 

w(h) > B at the tth step 
& # t (h)-t/2 > B = 3 7 f /4 = 3 7o t /4 
& #t(h) > E[jk(h)] + (t/2 + 3 7o to/4-E[# t (h)]) 

# t (h) > E[# t (/i)] + (3 7o V4-7ot/2) > E[# t (/i)]+ 7o t /4. 

Therefore, if w{h) reaches to B within t steps, then #t(h) > E[#t(/t)] + 7oto/4 for some 
t < to- Hence, by using the Hoeffding bound |2.1| , we can derive the following bound. 
(Here recall that 7 < 70 and t = (7/70)^0-) 

P^) < exp(-c H (— ) tj < expj^-^j < expj^-— j. 

On the other hand, by our choice of to (i.e., &cs), we have exp(— cn7 2 to/16) < 8/2n. □ 



Lemma 3.3. P 2 (h) < 5/2n. 
Proof. First we note the following. 

P2(h) = Pr{ [J [ [w{h) reaches to B within t steps] 

to+i<t ^ [w(h ) has not reached to B within t — 1 steps] ] } 

< Pr{ [J [w(ho) has not reached to B within t — 1 steps] } 
cs t +i<t 

< V] Pr{ w(ho) has not reached to I? within £ — 1 steps }. 

to+l<t CS 

Thus, we estimate the probability P^h, t) = Pr cs { w{ho) has not reached to B in t steps }, 
for each t > to. 

Here we modify CS slightly (which we call CS') so that it does not terminate even if 
some of the weights reaches to B, and let w t (h) denote the weight of h at the tth step 
in the execution of CS'. Note that if w(ho) has not reached to B in CS within t steps 
(including the tth step), then w t (ho) < B in CS'. On the other hand, we have 
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w t (h ) < B & #t(ho)-t/2 < B = 3 7 f /4 = 3 7o t /4 

& # t (h ) < E[# t (/i )] + (t/2 + 3 7o t /4-E[# 4 (/i )]) 

& # t (/i ) < E[#i(/i )] + (3 7 oW4-7ot) < E[# t (/i)]- 7o t/4. 

Therefore, if w(h ) has not reached to P in t steps in CS, then #*(/io) < E[#t(/i)] — 7 ot/4 
in CS . Hence, by using the Hoeffding bound again, we get P 2 '(M) < exp(-c H7 ot/16). 

Now we estimate St +i<t -^(^ First for any A > 0, consider P 2 (/i, t + A). From 
the above, we have P 2 (/i,t) < P ■ exp(— (ch 7 o/16)A), where Po = exp(— (ch 7 o*o/16)). 
Hence, if A > 16/ch 7o \ then to + A) < P ■ e~ l . In general, if A > /c(16/ch 7 o), then 
Pi (t + A) < P ■ e~ k . Therefore we haveQ 

t>t A>0 

< 16 1 5 c H (e - 1) 7 2 16e < S 

~ ° c h7 o 1 — 2n 16e c H (e — 1) 7 q ~~ 2n 

(Note that P < exp(— (cH 7 2 to/16)), which is less than (5/2n)(cn(e — l) 7 2 /16e) by our 
choice of to (i-e., &cs)-) □ 

Though valid, our estimation of error probability is not tight, and it may not give us 
a useful bound B for practical applications. Here under a certain assumption (i.e., the 
independence of hypotheses), we can derive a much better formula for computing B. 

Theorem 3.4. Consider a modification of CS, where we use the following definition for 
bcs- 

161n(2n/a) 
&cs(n,d, 7 ) = 2 • 

Assume that for any h and h', the correctness of h on a randomly given example x is 
independent from that of h'. (See the proof below for the precise condition.) Then we 



can show the same reliability for CS as Theorem |3.1| for the modified algorithm. 



Proof. It is easy to see that the new bcs is good enough for showing Lemma |3.2| (i.e., 
P\{h) < S/2n)] on the other hand, the proof of Lemma |3.3| requires the previous bcs- 
Thus, we do over the estimation of P2(h) again. 
This time we bound P 2 as follows. 



1 Precisely speaking, the factor 8/jq should be |~8/7q] ; but the effect of the ceiling function is negligible, 
we omit it for simplifying our discussion. 
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P%{h) — P r { U [[ w {h) reaches B within t steps] 

CS ^ 4-Kt 

~ A [w(ho) has not reached B within t — 1 steps] ] } 
< ^ Pr{ [w(h) reaches B within t steps] 

t +Kt 

- A [w(h ) has not reached B within t — 1 steps] } 

Now we use our assumption, the independence of hypotheses; more specifically, we 
assume, for any h G 7^bad, that Pr{ [w{h) reaches B within t steps] A [w(h ) has not 
reached B within t — 1 steps]} = Pr{[w(h) reaches B within t steps]} x Pr{[u>(fto) has 
not reached B within t — 1 steps]}. Then from the above, we obtain the following bound. 

P2(h) < ^ Pr{u>(/io) has not reached B within t — 1 steps} 
t 0+ i< t cs 

x Prcs{ if (/i) reaches i? within t steps }. 

On the other hand, we can show that, for any t > t Q + 1, Prcs{ w(h ) has not reached 
B within t — 1 steps } < 5/2n. (See the proof of Lemma |3.3| .) Therefore, we have 

5 5 
P 2 (/i) < 51 — x Pr{w(/i) reaches B within t steps} < — . 

io+l<t 

□ 

It may be unlikely that h is independent from all hypotheses in 7ibad- We may 
reasonably assume, however, that for any h 6 7ibad, there exists some h' 6 7i goo d such 
that h and ft/ are (approximately) independent, and our poof above works similarly for 
such an assumption. Thus, in most cases, we may safely use the simplified version of &cs, 
and we will use it in the following discussion. 

Next let us discuss the complexity of our algorithm CS. Here by "complexity", we 
mean the number of steps that CS(<5, 7) needs to yield a hypothesis, or in other words, 
the number of examples used to select a hypothesis. 

Consider the execution of CS on some 5 > and 7 < 70- It is easy to see that, after 
t steps, the weight of h becomes 7 t on average. Thus, on average, the weight reaches B 
in B/jo stepsQ. From this observation, we may use the following function for the average 
complexity of CS(<5, 7). 

, c B 121n(2n/5) 

*cs(M,7,7o) = — = • 

7o c H 77o 



2 Precisely speaking, our argument is not mathematically correct, because we estimate here 
min{i|E[u> t (/io)] > B}, whereas what we need to estimate is E[min{i|-u; t (/io) > B}}. 
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3.2 Adaptive Selection Algorithm 

In this section we give a different algorithm that does not use any knowledge on the 
accuracy of the best hypothesis in the class ( recall that algorithm CS used the knowledge 
of a lower bound on 7 .). To achieve this goal, we modify the condition of the while loop 
so it is changing adaptively according to the number of examples we are collecting. We 
call the algorithm AS from adaptive selection. The algorithm is stated as follows. 

Algorithm AS(5) 

5^0; t <- 0; 1/5; 

while V/i G H [# t (/i) < t/2 + 5te/2] do 

(x,b) <- EX V Q; 

S <- SU{(x,b)}; t <- t + 1; 

£ <- v /41n(3n/5)/(c H t); 
end-while 

output h EH with the largest #t(/i); 



Remark. The condition of the while-loop is trivially satisfied until the algorithm collects 
enough number of examples for S, i.e., \\S\\ > 4 ln(3n/<5)/(cH(l/5) 2 ). Thus, in practice, 
we start the while-loop after obtaining 4 ln(3n/<5)/(cH(l/5) 2 ) examples for S. 

Again we begin by investigating the reliability of this algorithm. 

Theorem 3.5. For any 5, < 5 < 1, with probability more than 1 — 5, AS(5) yields 
some hypothesis h G H goo d. 

Proof. Our goal is to show that when the algorithm stops it outputs a hypothesis h G 
Hg od with probability more than 1 — 5. That is, we want to show the following probability 
is larger than 1 — 5. 

-Pcrct = Pr{ [J [ AS stops at the tth step and yields some h G H goo d ] } 
As t>i 

= AS stops at the tth step and yields some h G H goo d } 

t>i AS 

= Pr{ AS yields some h G H goo d | AS stops at the tth step } 
t>i AS 

x PrAst AS stops at the tth step }. 

Consider any t > 1, and assume in the following that the algorithm stops at the tth 
step, i.e., just after the tth while-iteration. (Thus, we discuss here probability under the 
condition that AS stops at the tth step.) Let e t and S t be the value of e and S at the tth 
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step. Also let h be the hypothesis that AS yields; that is, is the largest at the tth 

step. 

By our choice of e t , we know that t = 4ln(3n/S)/(cuef), and thus, by Lemma [T6 
given below, the following inequalities hold with probability > 1 — 5. 

prc v (h ) < pic v (h)+e t , and \pic v {h) - # t (h) / 1\ < e t /2. 

From the second inequality, we have that #t(/i) < t(st/2 + prc v (h)), and since we 
know that l/2+7 = prc v (ho) > prc v (h), we get that < t/2+tj +te t /2. Moreover, 

since the algorithm stopped, the condition of the while-loop is not satisfied and thus, the 
following holds. 

t/2 + 5te t /2 < # t (h) < t/2 + <7o + tet/ 2 

This implies that et < 7o/2. With this fact together with the first inequality above (i.e., 
prc x ,(/i ) < P rc o(^) + e t ), we can conclude that l/2 + 7 /2 < prc X) (/i). 

Therefore, for any t > 1, we have PrAs{ AS yields some h e 7i goo d | AS stops at the tth 
step } > 1 — 5. This, together with the fact that J2t>i P r As{ AS stops at the tth step} = 1 
proves the theorem. □ 

Lemma 3.6. For a given e, < e < 1, let t = 41n(3n/(5)/(cH£ 2 ), and consider the point 
in the execution of the algorithm just after the tth step. Then for any h G 7i such that 
#t(h) > #t(/io) we have 

Pr{[prc v (h )<prc v (h)+e} A [ Iprc^) - # t {h)/t\ < e/2] } > 1 - 5. 

Proof. Fix any h G 7i such that 4^t{h) > #t{ho) and let A(h) and B(h) denote the 
following conditions. 

A(h) & [prc v (h ) <prc v (h)+e] A [ \prc v (h) - # t (h)/t\ < e/2 ], and 
B(h) & [pxc v (h Q )-# t (h )/t<e/2} A [\ W c v {h)-4 t {h)/t\ <e/2]. 

We first show that B(h) implies A(h). Notice that B(h) implies that 

[ (prc v (h ) - # t (h )/t) + (# t (h)/t - prc v (h)) < e] A [ \prc v (h) - # t (h)/t\ < e/2]. 
Rewriting we obtain that 

[ (#tWA - #t{ho)/t) + (prc c (/i ) - pvc v (h)) < e\ A [ (prc^/i) - # t (/i)M < e/2], 
and since #t(/i) > # t (h ), it must hold that 



11 



[ P ic v (h )< P Tc v (h)+e] A [\ WCv (h)-4 t (h)/t\<e/2,l 

which is condition A(h). 

Now we show that PrAs{ _, -B(/i)} < 8. Thus, by the union bound and the Hoeffding 
bound (Theorem |2.1| ), the probability over the choice of sample S of size t (which is the 
same as the probability over the execution of AS until the tth step) that there exists one 
h G TC such that B(h) does not hold is less than 3nexp(cn(£/2) 2 t), which is, by choice of 
t, equal to 8. Then since ->A(h) =>- ->B(h), the lemma follows. □ 

Next we discuss the complexity of the algorithm. Here we can prove the following 
bound. 

Theorem 3.7. For any 8, < 8 < 1, with probability more than 1 — 8, AS (8) terminates 
within 641n(3n/5)/cH7o steps. 

Proof. Here we use the same notation as above. Notice first that while we are in the 
while-loop, the value of e is always strictly decreasing. Suppose that at some step t, e t 
has became small enough so that Ae t < 70 ■ Then from Lemma |3.6| (the condition of 
the lemma always holds due to our choice of e), with probability > 1 — 8, we have that 
t(prc v (h) — e t /2) < #t(/i), and prc v (h Q ) — e t < prc r> (/i). Putting these two inequalities 
together, we obtain that t/2 + tjo — ts t — te t /2 < #t(/i) (since pic v (h ) = 1/2 + 70). 
Since we assumed that 4e t < 70, we can conclude that t/2 + 5te t /2 < and thus, 

the condition of the loop is falsified. That is, the algorithm terminates (at least) after the 
tth while-iteration. 



Recall that e t is defined to be y41n(3|7i|/<5)/(cHi) at any step. Thus, when we reach 
to the tth step with t = 641n(3|7i|/5)/(cH7o), then it mush hold that e t < 7/4, and by 
the above argument, the algorithm terminates with probability larger than 1 — 8. 
Remark. Thus, we use the following function for our theoretical bound for the number 
of examples used by AS. 

, c , 161n(3|W|/<y) 

t AS M,7 = 2 — ■ 

c H 7 

□ 

Again this theoretical bound is not tight. As we will see in the next section, our 
experiments show that the value of e, when the algorithm stops, is close to 7o/2 instead 
of 7o/4. Thus, the number of examples is much smaller than this theoretical bound. 
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4 Experimental Evaluation of the Algorithms 



We first summarize three selection algorithms considered, and state functions that bound 
the sufficient number of examples to guarantee, in theory, that the algorithm selects with 
probability > 1 — 5 a hypothesis h with prc v (h) > 1/2 + jo/2. (Recall that we assume 
that a given hypothesis set TC has some h with pic v (h) > 1/2 + 7o/2.) 

• Batch Selection: BS(n, 8, 7) (see Introduction) 

Bound: £bs(^, = 161n(2n/5)/(cn7 2 ) on worst case. Condition: 7 < 70. 

• Constrained Selection: CS(n, 5, 7) 

Bound: tcs(n, 5,7) = 121n(2n/<5)/(cH77o) on average. Condition: 7 < j . 

• Adaptive Selection: AS(n, S) 

Bound: tAs( n , 8) = 641n(3n/5)/(cn7o) on worst case. Condition: None. 

Thus, for example, if we know 70 and use it as 7, then tBs(n, 5, 7) examples are 
enough to guarantee 1 — 5 confidence for BS. We compare these theoretical bounds with 
the numbers that we obtained through experiments. 

First we describe the setup used in our experiments. We decided to use synthetic 
data instead of real datasets so that we can investigate our algorithms in a wider range of 
parameter values. (In future work we are planning to evaluate also them with real data.) 

The common fixed parameters involved in our experiments are 5, the confidence pa- 
rameter, and n, the number of hypotheses in 7i. Notice that these two parameters are 
inside a logarithm in the above bounds; thus, results are not really affected by modifying 
them. 

In fact, we verified this experimentally, and based on those results we set them to 
18 for n, and 0.01 for 5; that is, we require confidence of 99%. The other parameter is 
the accuracy of the best hypothesis, which is specified by 70. In our experiments the 
value of 70 ranges from 0.04 to 0.3 with a increment of 0.01 (that is, the accuracy of the 
best hypothesis ranges from 54% to 80% with a increment of 0.4%) and we have a total 
of 65 different values. For each 70, we distributed the 18 hypotheses in 9 groups of 2 
hypotheses, where the accuracy of hypotheses in each group is set 1/2 — 7, 1/2 — 37/4, 
1/2 + 37/4, I/2 + 7. The choice of the distribution of hypotheses accuracy does not affect 
the performance of neither BS nor AS (because their performance depends only on the 
accuracy of the best hypothesis). On the other hand, it seems to affect the performance 
of CS. For this reason, we also tried other distributions of the hypotheses accuracy for 
CS. For a random number generator, we used one explained in JFj. 

For each set of parameters, we generated a success pattern for each hypothesis h. 
A success pattern is a 0/1 string of 1000 bits that are used to determine whether the 
hypothesis h predicts correctly for a given example. That is, to simulate the behavior 
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of h on examples from EX-pQ, we just draw a random number % between 1 and 1000, 
and decide h predicts correctly /wrongly on the current example if the zth bit of the 
success pattern is 1/0. Finally, for every fixed setting of all the parameters, we run this 
experiments 30 times, i.e., run each algorithm 30 times, and averaged the results. This is 
what is reflected on the graphs we have throughout this section. 

1. The Tightness of Theoretical Bounds 

Let us assume that we know the value of 70, not just a lower bound. Then, from the bounds 
summarized first, one may think that, e.g., CS is more efficient than BS. It turned out, 
however, it is not the case. Our experiment shows that the number of required examples 
is similar among three algorithms, and the difference is the tightness of our theoretical 
bounds. Of course, this is for the case when 70 is known, see the subsection below for a 
discussion on this issue. 

We checked that the "necessary and sufficient" number of examples is proportional 
to I/70 (where n and 5 are fixed). Thus, we changed the parameter ch to get the tightest 
bounds; that is, for each algorithm, we obtained the smallest ch with which the algorithm 
does not make any mistake in 30 runs. The graph (a) of Figure [1] shows the number of 
examples needed by three algorithms with such almost optimal constants. There is not so 
much difference, in particular, between CS and AS. Thus, the tightness of our estimation 
seems to be the main factor of the difference of theoretical bounds when 70 is known. 



§ 4000 -V 



Algorithm AS 
Algorithm CS - 
Algorithm BS ■ 




Accuracy o( the besl hypothi 




Accuracy of Ihe best hypothe; 



(a) With almost optimal constants. (b) With ch = 4. 

Figure 1: the number of examples vs. 70 

It is, however, impossible in real applications to estimate the optimal constant and get 
the tightest bound. Nevertheless, we can still get a better bound by a simple calculation. 
Recall that the Hoeffding bound is a general bound for tail probabilities of Bernoulli trials. 
While it may be hard to improve the constant ch in general, we can numerically calculate 
a better one for a given set of parameters. For instance, for our experiments, we can safely 
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use ch = 4 instead of ch = 2, and the difference is half; e.g., £bs(18, 0.01, 0.1) (so the best 
hypothesis has 60% of accuracy) is 6550 with ch = 2 but 3275 with ch = 4. The graph 
(b) of Figure [L] shows the number of examples needed by three algorithms with ch = 4. 
Thus, when using these algorithms, it is recommended to estimate first an appropriate 
constant ch, and use it in the algorithms. For such usage, CS is the most efficient for the 
set of parameters we used. 

2. Comparison of Three Algorithms 

The graph (b) of Figure [l] indicates that CS is best (at least within this range of parame- 
ters) if 70 or a good approximation of it is known. The situation differs a lot if we do not 
know 7 . For example, if 7 = 0.2 but it is underestimated as 0.05, then BS and CS need 
13101 and 2308 examples, while AS needs only 1237 examples; thus, in that case AS is 
the most efficient. This phenomenon is shown in Figure 2, where we fixed 70 to be 0.2% 
(so the accuracy of the best hypothesis is 70%), and we changed the value of the lower 
bound 7 from 0.04 to 0.2. Algorithm AS is not affected by the value of 7, and hence it 
uses the same number of examples (the horizontal line in the graph). With this graph we 
can see that, for instance, when 7 ranges from 0.04 to 0.058, algorithm AS is the most 
efficient, while from 0.058 to 0.2 algorithm CS becomes the best; but in any case, the 
difference is not so big within this range of 7. On the other hand, the performance of BS 
becomes considerably bad if we underestimate 70 and the number of examples needed by 
this algorithm migh become huge. 
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Figure 2: t vs. 7 (70 = 0.2) Figure 3: t/B^o vs. 70 

(t denotes the number of examples.) 



3. CS: Constant dec vs. Variable dec 

For simplifying our theoretical analysis, we assumed that dec (recall that dec was n' / n) is 
constant 1/2. In fact, there are two choices: either (i) to use constant dec, or (ii) to use 
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variable dec. We investigate whether it affects the performance of the algorithm CS. We 
verified that it does not affect at all the reliability of CS. On the other hand, it affects 
the efficiency of CS, i.e., the number of examples needed by CS. 

Intuitively the following is clear: If the distribution of hypotheses accuracy is sym- 
metric (like in the above experiment), then the number of successful hypotheses, at each 
step, is about n/2; thus, dec ~ 1/2, and the number of examples does not change between 
(i) and (ii). On the other hand, if most of the hypotheses are better than 1/2 (resp., most 
of the hypotheses are worse than 1/2), then the number of examples gets larger (resp., 
smaller) in (ii) than in (i). We verified this intuition experimentally. Figure 3 shows the 
ratio between the number of examples and B/jo (which is always close to 1 if dec = 1/2) 
for three different distributions of hypotheses accuracy: symmetric, positively biased, and 
negatively biased. Thus, when the distribution is negatively biased, which is the case in 
many applications, we recommend to use the original CS with variable dec. 



4. AS: e vs. 70, and the Theoretical Bound 



From the theoretical analysis of Theorem |3~77|, we obtained that the algorithm stops 



with high probability when e becomes smaller than 7q/4. On the other hand, to guarantee 
the correctness of our algorithm (Theorem |3.5| ), we just need to conclude that e is smaller 
than 7o/2. This difference gets reflected in our theoretical bound for the number of 
examples. Our experiments (see Figure 4) showed that the number of examples is much 
smaller than the theoretical bound. The reason is that, in most cases, the algorithm 
stops much before e becomes as low as 7o/4; it is more likely, that AS stops as soon as e 
becomes slightly smaller than 70/2. Figure 5 reflect this phenomenon; the final value of e 
is closer to 7o/2 than 7o/4. (It is in fact on the 7/2.38 line.) If we assume that the final 
value of £ is about 7o/2.38 then, by using the relation between t and e, we can estimate 
the number of examples as 4(2. 38) 2 ln(3n/5)/(cn7o)- 
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